Data Wrangling
This section describes the techniques we used to transform the data into a format that we could perform analysis on. After we collected the tweets we had hundreds of separate .json files. So, we wrote a python script to combine all the files into one for each category.
Code that merges multiple JSON files into one JSON file
import json
import csv
import glob
import os
import ast
#create and open output file
if not os.path.exists("json_merged/"):
os.makedirs("json_merged/")
outfile_name = "json_merged/halftime_show_merged.json"
print "Merging jsons"
with open(outfile_name, "a") as outfile:
path = "C:\python27\streaming data\*.json"
for json_file in glob.glob(path):
#open json file and parse data while writing into output file
with open(json_file, 'r') as open_json_file:
outfile.write(open_json_file.read())
Now the data is a little easier to handle, from a few hundred files down to four. But, we still need to pull out the information we need and put it into csv format. To do this we wrote another short python script that loads in our large json file and parses each tweet for the data we are looking for.
Code that writes each line (tweet) from JSON file to a CSV file in CSV format
import json
import csv
import os
import ast
with open ("C:\python27\scripts\json_merged\halftime_show.cvs", "a") as out_file:
#create csv writer
csv = csv.writer(out_file)
#write header to out file
print >> out_file, 'tweet_id, tweet_time, tweet_author, tweet_author_id, tweet_language, tweet_geo, tweet_text'
#open json file and parse data
with open("C:\python27\scripts\json_merged\halftime_show_merged.json", 'r') as open_json_file:
#Get each tweet
for line in open_json_file:
try:
tweet = json.loads(line)
# row represents the attributes we are pulling from each tweet
row = (
tweet['id'], # tweet_id
tweet['created_at'], # tweet_time
tweet['user']['screen_name'], # tweet_author
tweet['user']['location'], # tweeter location
tweet['user']['id_str'], # tweet_authod_id
tweet['lang'], # tweet_language
tweet['geo'], # tweet_geo
tweet['text'], # tweet_text
tweet['timestamp_ms'] # tweet time in ms
)
values = [(value.encode('utf8') if hasattr(value, 'encode') else value) for value in row]
csv.writerow(values)
except:
pass
After running this script the tweets are now in a clean csv file that can be imported to R studio for analysis.